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ABSTRACT: This and a companion paper propose techniques for constructing parametric mathematical mod- 

els describing key features of the distribution of an output variable given input-output data. By contrast to stan- 
dard models, which yield a single output value at each value of the input, Random Predictors Models (RPMs) 
yield a random variable at each value of the input. Optimization-based strategies for calculating RPMs having a 
polynomial dependency on the input and a linear dependency on the parameters are proposed. These formula- 
tions yield RPMs having various levels of fidelity in which the mean, the variance, and the range of the model’s 
parameter, thus of the output, are prescribed. As such they encompass all RPMs conforming to these prescrip- 
tions. The RPMs are optimal in the sense that they yield the tightest predictions for which all (or, depending on 
the formulation, most) of the observations are less than a fixed number of standard deviations from the mean 
prediction. When the data satisfies mild stochastic assumptions, and the optimization problem(s) used to calcu- 
late the RPM is convex (or, when its solution coincides with the solution to an auxiliary convex problem), the 
model’s reliability, which is the probability that a future observation would be within the predicted ranges, is 
bounded rigorously. 


1 INTRODUCTION 

It is assumed the reader is familiar with the content 
of the companion paper (Crespo et al. 2015). The in- 
troduction, and literature review therein apply to this 
paper as well but have been omitted here due space 
limitations. Whereas (Crespo et al. 2015) focuses on 
Type- 1 and Type-2 RPMs, this paper focuses on Type- 
3 and Type-4 RPMs. The overlap between the papers 
has been kept to a minimum. 

2 PROBLEM STATEMENT 

A system is postulated to act on a vector of input vari- 
ables x to produce an output y. The output can de- 
pend on the state variables and on some other influ- 
ences, causing, for instance, intrinsic variability. Let 
X C M n;c be a set of input variables, and Y C be a 
set of outputs which might result from the system act- 
ing on elements of X. In the following, the focus will 
be on the single-output (n y = 1) multi-input (n x > 1) 
case. In this setting the two main problems of interest 
can be stated as follows. Let z = {^} = {(a;*, ?/*)}, for 
i — 1, . . . , N, be a sequence of observations generated 
by a Data Generating Mechanism (DGM). First, we 
want to find an empirical model that, when evaluated 
at a new value x^ + \ of the state, returns an informa- 
tive prediction of the unobserved output yN+i ■ An in- 


formative prediction can be interpreted as a prediction 
that is consistent with salient features of the data com- 
prising z. These features, which are cast by the ana- 
lyst as design requirements on the RPM (for example, 
we might want all observed outcomes to be less than 
2-standard deviations from the mean prediction), are 
cast as inequality constraints in the optimization prob- 
lems used to calculate the model. Second, we want to 
quantify the probability that yN+i is compliant with 
such requirements. To continue the example, we want 
to evaluate the probability that yN+i is less than 2- 
standard deviations away from the mean prediction. 


3 INTERVAL PREDICTOR MODELS 

This section introduces concepts from Interval Pre- 
dictor Models (IPM) that are essential for RPMs. Ad- 
ditional information on IPMs and examples are avail- 
able in (Crespo et al. 2014). An IPM is simply a map- 
ping that assigns an output interval for each value of 
the input. In the context of this paper, an IPM assigns 
to each input vector x G X a corresponding outcome 
interval in Y. A parametric IPM is obtained by associ- 
ating to each x G X the set of outputs y corresponding 
to all values of p in P : 

I y (x,P) = {y = p T p(x), p E P}, 


( 1 ) 



where <p(x) is a vector of monomials, and 

P = {p : p < p < p}. (2) 

The analyst is free to choose which monomials are 
relevant to the particular application. A general repre- 
sentation of a multivariate polynomial basis is 

<p(x) = [l,x® 2 ,x* 3 , . . . ,o7 n ] T , (3) 

where x = [xi , . . . , x n J is the state, and the vector 
ij = [ijj, . . . , ij t n x \, with ij f R for j k has the ex- 
ponents of the monomials. 

The limits of the IPM prescribed by (1-3) can be 
explicitly computed as 

I y (x,p,p) = [y(x,p,p), y(x,p,p)], (4) 

where 

y(x,p,p) = p(x) T =^j - <^(|x|) T (5) 

y(x,p,p) = p(x) T + ¥>(M) T y ”) (6) 

Therefore, the envelopes of the interval valued func- 
tion I y , are linear functions of p and p. and piecewise 
polynomial functions of the input. The spread of I y , 
which is the separation between its limits, is 

8 y (x,p,p) = <p(\x\) T (p-p). (7) 

Note that the spread depends on the size of the un- 
certainty box P, but is independent of its geometric 
center. 

Commonly, the DGM is approximated by the Least 
Square (LS) prediction, y = p[ s p(x), where p L s is 
given by 

Pls = {A T A) 1 A T [yi,...,y N ] r , (8) 

Ai,j = <Pj(xi), for i — 1, . . . , N and j — 1, . . . , n p . 

3.1 Type-1 IP Ms 

A Type-1 IPM is given by Equations (1-3) where P 
is the solution to the following Optimization Problem 
(OP). 

Optimization Problem 1 (OP1): The limits of P are given 
by 

(p,p) = argmin {E x [S y (x,pb,pf)} : p a < p b , 

Pb,Pa 


Therefore, a Type-1 IPM yields a P that minimizes 
the expected interval spread such that all the observed 
outputs are within I y {x). When a; is a random vector 
of known distribution, the cost function in (9) can be 
calculated analytically. Otherwise, the sample mean 
based on the data can be used to approximate it. The 
resulting IPM, which is calculated by solving the con- 
vex optimization problem in (9), admits a rigorous 
reliability assessment (see Section 5). This assess- 
ment quantifies the probability that a future observa- 
tion will fall within I y {x). 

The membership of p L s i n P can be ensured by 
replacing the first constraint with p Ll < p LS < p b , 
or adding the constraint p a + p b = 2 p LS . In general, 
the inclusion of these constraints leads to IPMs with 
larger expected spreads, with the equality constraint 
leading to the larger of the two. A formulation result- 
ing from adding either of these two sets of constraints 
will be called Augmented. 


4 RANDOM PREDICTOR MODELS 

A RPM is a mapping that assigns to each input vector 
x G X a corresponding random variable in the output 
space Y. That is, an RPM is a random variable-valued 
map 

R : x — > R y {x ) C Y, ( 10 ) 

where x is the input, and R y (x) is a random process 
whose support lies in Y. A parametric RPM is ob- 
tained by associating to each x G X the set of outputs 
y corresponding to all values of p described by a ran- 
dom vector with joint Cumulative Distribution Func- 
tion (CDF) Fp{p) having the support set P. As before, 
attention will be limited to the case where the output 
is a linear function of the parameter p, and a polyno- 
mial function of x. This leads to 

Ry(x) = {; y = p T (p(x ), p : F p (p), p G P}. ( 11 ) 

Denote by p e M np , v e M np , and c E M n A n P -i)/2 the 

mean, variance and correlation of p respectively. The 
variance and correlation fully prescribe the covari- 
ance matrix C(u,c) E M. n P xn p. It can be shown that 
any random vector with a support set P as in (2) must 
satisfy the consistency equations: 

p < p < p, ( 12 ) 

0 <v < (p-p)0(p~ p), ( 13 ) 


y(xi,p h ,p a ) <yi< y(xi,p b ,p a )} , (9) 

-1 < c < 1, (14) 


where E x [- } is the expected value operator with re- 
spect to the input x, and (x i: y.i) for i = 1, ... ,N are 
the observations in z. 


C(u, c ) A 0. 


(15) 



The symbols 0 and 0 denote the component-wise 
product of vectors, and positive semidefiniteness re- 
spectively. 

The random process R y (x ) is fully prescribed by 
the CDF of p. Naturally, statistics of the output y, 
such as the mean p y (x) = E p [y(x,p)], the variance 
is.y (x) = E p [(y {x, p) - n y (x) )‘- 2 ] , and the range I y (x) = 
[min p y(x,p),max p y(x,p)\, vary with x. In particu- 
lar, the mean prediction is p y (x, p) = p T (p(x), the 
output’s variance is 

u y (x,u,c) = p{x) T C{v,c)ip{x), (16) 

and the output’s range is the interval value function 
(4). When the components of p are uncorrelated, (16) 
reduces to 1 

v y (x,v) = u T (p 2 (x). (17) 

A few metrics for characterizing R y (x) are intro- 
duced next. The cr-surface, which connects all the out- 
puts y that are r standard deviations from the mean 
prediction, is defined by 

l(x,p,r,u,c) = p T <p{x) + rJv y {x,v,c). (18) 

The o- volume, defined as 

I a {x,p,T,v) = [l(x,p,~T,U,c),l(x,p,T,U,c)\ , (19) 

contains all the outputs y that are no more than r stan- 
dard deviations away from p y (x). For the value of r 
to be feasible (i.e., for the cr-surface to be within the 
support of R y (x)), it must satisfy 

y(x,p,p. ) <l{x,p,T,v,c) <y(x,p,p). (20) 

Equation (20) ensures that the support of the process 
contains outcomes that are up to r standard deviations 
from the mean prediction. Note that the range of stan- 
dard deviation values satisfying these inequalities is a 
function of x. 

The formulations that follow prescribe key statis- 
tics of p, thus of the random output y{x), based on 
input-output data. As such they encompass all RPMs 
that conform to these statistics. Four types of RPMs 
are proposed. Type-1 RPMs and Type-2 RPMs, cov- 
ered in detail and exemplified in (Crespo et al. 2015), 
prescribe the mean and variance of p. Conversely, 
Type-3 and Type-4 RPMs also prescribe the range 
of the output. Whereas Type-3 RPMs emphasize the 
tightness of the outputs’ range, Type-4 RPMs em- 
phasize the tightness of the a- volume. In contrast to 
Type-1 and Type-2 RPMs, which only require solving 
a single OP, Type-3 and Type-4 RPMs require solv- 
ing a pair of interdependent OPs. The formulations 
below only consider c = 0. Extensions to the corre- 


1 When the correlation c is zero, the corresponding argument 

of any function depending on it will be dropped. 


lated case can easily be made. Furthermore, the selec- 
tion of p as p\ s is arbitrary, and any other value can 
be used. In the developments that follow, the Perfor- 
mance of an RPM refers to the property evaluated by 
the cost function in the corresponding OP. The section 
that follows covers the essentials of Type- 1 RPMs and 
Type-2 RPMs needed to calculate Type-3 and Type-4 
RPMs. 

4.1 Type-1 RPMs 

Type-1 RPMs prescribe the mean and variance of 
R y (x) when the entire data set in z is used. A Type-1 
RPM is given by Equations (3, 11), where p is a vector 
of uncorrelated random variables with expected value 
p = Pls, and a variance v — v, given by the solution 
to the following program. 

Optimization Problem 2 (OP2): The variance of p is 
equal to 

v = argmin {E x [u y (x, v)\ : l{x u p,-a max ,iy) < y t < 

l(xi,p,a max , v) for i = 1 (21) 

where cr max > 0 is a parameter prescribed by the ana- 
lyst, and (. Xi , yf) for i = 1, . . . , N are the observations 
in z. 

Hence, a Type 1-RPM minimizes the expected vari- 
ance of the random process R y (x) such that all ob- 
servations are no more than <r max standard deviations 
away from the mean prediction, i.e., all observations 
are within the rr-volumc I a (x, p, <j max , v). 

Note that both Type-1 IPMs and Type-1 RPMs re- 
quire solving a convex OP. As such they can effi- 
ciently handle hundreds of thousands of data points, 
thus of input dimensions. This is in sharp contrast 
to Gaussian Processes which are limited to a few 
thousand data points before becoming numerically in- 
tractable. 

A Type-1 RPM does not prescribe the support of p, 
thus, of R y (x). Any random vector satisfying the con- 
sistency Equations (12-15) for p = p LS and v = z> is 
a valid characterization of F p (p). Since Type-1 RPMs 
are calculated by solving a convex OP, they admit a 
rigorous reliability assessment. This assessment, pre- 
sented in Section 5, quantifies the probability that 
a future observation will fall within the rr-volume 
I a {Xi p, CT max , F) . 

4.1.1 Outliers in the Data Set 

The presence of outliers in the data yields undesirably 
large a-volumes and uncertainty sets, diminishing the 
RPMs performance. Whereas the limits of the optimal 
I a might be prescribed by a few observations, the ma- 
jority of them might be much closer to the mean pre- 
diction. The outliers, whose removal from the data set 



will lead to smaller predicted variances, can be iden- 
tified using anyone of several figures of merit. This 
paper will use the figure of merit 


Ki(/x,z/, c) 


(yi- H T p(xj)) 2 
v y (xi,v,c) 


( 22 ) 


where v is the variance of p. n t is a variance- 
normalized distance squared between the ?th observed 
output and the mean prediction at the corresponding 
input. Outliers will be identified by determining the 
data points corresponding to the largest percentiles of 
the empirical CDF of n, F k ^(k), for i — 1, . . . , N, 
i.e., (xi,y.i) is an outlier if F^^nf) > A where 0 <C 
A < 1. Once the outliers are identified, they can be 
removed from the data sequence and a new Type-1 
RPM will be calculated. The resulting RPM will at- 
tain tighter predictions for A fraction of the observa- 
tions in z, while the prediction for the remaining 1 — A 
fraction might be considerably degraded. The outliers 
found by this procedure will be the same regardless of 
the value of cr max . 


4.2 Type-2 RPMs 

A formulation leading to an alternative RPM is pre- 
sented next. In contrast to Type-I RPMs, this approach 
searches for v by using only a fixed percentage of the 
N observations available. The observations compris- 
ing the removed set are worst-case in the sense that 
their removal tightens the optimal a - volume the most. 

In particular, a Type-2 RPM is given by Equations 
(3, 11), where p is a vector of uncorrelated variables 
with expected value p = p LS , and a variance v = z> 
given by the following OP. 

Optimization Problem 3 (OP3): The variance ofp is 
v = argmin {E x [u y (x, v)\ : F k(v) (o 2 ax ) > A } , (23) 

i />0 

where cr max > 0 is a parameter prescribed by the an- 
alyst, F k ( u} is the empirical CDF of k(u) in (22) 
based on the N observations in z, and 0 < A < 1, 
another parameter to be chosen by the analyst, is 
the proportion of observations to be contained by 

2 a (x, p , (J m ax ^ ^(A)). 


OP3 is a non-convex formulation, which for A = 
1 yields the same RPM as OP2. When A < 1, 
a fixed number of observations (outliers) are ne- 
glected as the RPM is being calculated. Outliers 
can be easily identified by finding the data points 
for which F K ^) (nfu)) > A. The points not satis- 
fying this condition, which are the elements of z 
within Irr(x, p, cr max , z>(A)), constitute the sequence w. 
A Type- 1 RPM based on the data sequence w is equiv- 
alent to the Type-2 RPM in (23) based on the data 
sequence z. This relationship enables performing a 
reliability assessment of Type-2 RPMs. This assess- 
ment, presented in Section 5, formally quantifies the 
probability that a future observation will be within 
I a (x,p,o max ,i>( A)). 

4.3 Type-3 RPMs 

Type-3 RPMs not only prescribe the mean and vari- 
ance of p, thus of R y (x ), but also their ranges. Type-3 
RPMs optimize the tightness of both the range of the 
output, and of the cr-volume prioritizing the former. In 
contrast to Type-1 and Type-2 RPMs, the calculation 
of a Type-3 -RPM entails solving a sequence of two 
OPs. 

A Type-3 RPM is defined by Equations (3, 11), 
where p is given by the LS parameter estimate in (8). 
The support set P is prescribed by an Augmented ver- 
sion of (9), and the variance v is the solution to the 
following OP: 

Optimization Problem 4 (OP4): The variance ofp is 

v = argmin {E x [u y (x,u)] : F k(v) (a 2 max ) > A} , (24) 

0 < V m ax 


where v max = (p — p) 0 (p — p), p and p are given by 
(9), and F K ^ is the empirical CDF of n(y) in (22) 
based on the N observations in z. The parameters 
cr max and X, to be set by the analyst and defined earlier, 
must satisfy 


cr n 


> <tZ 


max 

l<i<JV 


| Vi- p T <p(xj)\ 

V U maxP 2 ( X i) 


and 0 < A < 1. 


(25) 


Hence, a Type-2 RPM minimizes the expected vari- 
ance of the random process R y (x) such that a A frac- 
tion of the observations are no more than <r max stan- 
dard deviations apart from the mean prediction. The 
tightening of the prediction for such a fraction yields 
a cr-volume I a (x,p,c r max , z>(A)) that does not enclose 
the remaining 1 — A fraction. This shows that (23) is 
a chance-constraint formulation (Charnes et al. 1958), 
in which one is willing to accept the occurrence of un- 
favorable low-probability events (probability 1 — A) 
for the sake of an improved performance for high- 
probability events (probability A). As with Type-1 
RPMs, cr max is essentially a scaling factor. 


Hence, a Type-3 RPM minimizes the expected vari- 
ance of the random process R y (x) given that (i) the 
cr-volume associated with <r max contains a A fraction 
of the observations, (ii) the relationships among the 
mean, the variance and the support set satisfy the con- 
sistency Equations (12-15), and (iii) the range of out- 
puts Iy(x) has minimal expected spread while con- 
taining all N observations. While Augmented OP1 is 
convex, the first inequality constraint in (24) makes 
OP4 non-convex. Notice that extreme observations 
prescribe the support set P in OP1, whereas the 
floor(iVA) observations attaining the smallest n, val- 
ues prescribe u in OP4. 



The solution to (9) enters (24) via the upper bound 
on v, v max . The constraint (25) ensures the feasible 
design space is non-empty. The ith component of the 
vector at the right hand side of (25) is the value of t, 
for which y t = l(xi, /i, r, , u max ). Hence, r, is the small- 
est number of standard deviations that can separate 
(. Xi,yi ) from the mean prediction without letting v ex- 
ceed r'max- 

When A = 1, the constraints in (24) can be writ- 
ten as a set of convex constraints. When A < 1, the 
constraints in (24) are equivalent to a subset of the 
convex constraints. This subset is given by all the ele- 
ments in z satisfying < A. The floor(iVA) 

observations satisfying this condition constitute the 
data sequence w. Therefore, OP4 based on the data 
sequence z renders the same empirical model as a 
convex-program based on the data sequence w. This is 
the basis used for evaluating the reliability of Type-3 
RPMs. To this end (See Theorem 2), it is useful to de- 
termine if / CT (x,yU,cr max , z>(A)) C I y (x,p,p ) holds, i.e., 
the a- volume associated with cr max is within the range 
of R y (x). This is the case if and only if the Contain- 
ment Condition 

<p(\x\ ) T <3-P) - \p{x) T (p + p- 2p)\- 

2cr m ax y^V(x) > 0, (26) 

holds for all x E X. This semi-infinite constraint can 
be evaluated rigorously using Bernstein polynomials 
and interval analysis. Type-3 RPMs satisfying (26) al- 
low for a tighter reliability assessment. Enforcing this 
condition by design requires incorporating (26) into 
(24). This practice, however, will not be considered 
in this paper. 

Note that Type-3 RPMs for the case in which A = 1 
can be found by solving a sequence of two convex 
OPs. This structure allows considering problems with 
hundreds of thousands of observations. In such a case 
outliers can be dealt with by identifying them and re- 
moving them from the data sequence in advance as 
explained in Section 4.1.1. 

Example 1: Two Type-3 RPMs based on the same 
data sequence used in (Crespo et al. 2015) are 
derived next. Whereas the two RPMs differ in 
the value of A used to calculate z>, both use the 
same set P. This set is calculated via an Aug- 
mented OP1 with p < p LS < p. This leads to 
p = [—12.9837, —1.1488, —0.8339, 0.0013, —0.0379, 

-0.0001, 0.0032] 1 ", and p = [7.2080,-1.1488, 
-0.8339, 0.0013, -0.0379, -0.0001, 0.0034] T . These 
values, in turn, yield an upper bound for v where the 
only significant component is z/ max ,i = 90.8037. This 
IPM’s performance is E x [S y ) = 10.4942. The bound 
on cr max resulting from (25) yields cr* ax = 1.4094. 
Thus, we selected cr max = 1.5. 

A Type-3 RPM for A = 1 is calculated first. There- 
fore, we require that all 150 observations be no more 



Figure 1: RPM D: Type-3 RPM for <j max = 1.5 and A = 1. 

than 1.5 standard deviations from the mean predic- 
tion. The resulting RPM, to be referred to as RPM D, 
leads to a variance z> for which the only significant 
term is z>i = 80.1699. The performance of RPM D is 
given by both E x [S y ] = 10.4942 and E x [u y \ ~ Ty. Fig- 
ure 1 shows RPM D. Whereas the limits of the range, 
I y (x), are shown as dashed lines, cr-surfaces separated 
by 0.5 units are shown as dashed-dotted lines. Note 
that the augmented constraint yielded a skewed ran- 
dom process with respect to its mean prediction. Fur- 
ther notice that the lower limit of the support coin- 
cides with the cr-surface associated with a = —1.5, 
whereas the values of a reaching the upper limit of 
the range vary. Even though the portions of the cr- 
surfaces spreading outside I y (x ) are infeasible (e.g., 
almost the entire a = 1.5 surface), they have been 
plotted for clarity. The feasible range of a values at 
each value of x is given by (20). Because the majority 
of the observations are close to the mean prediction, 
we can expect that neglecting a few extreme obser- 
vations will considerably improve the model’s perfor- 
mance. 

A Type-3 RPM for A = 143/150 is derived next. 
Therefore, we require that 143 observations be no 
more than 1.5 standard deviations from the mean pre- 
diction. This model, to be referred to as RPM E, leads 
to a variance z> for which Ty = 22.2497 is the only 
significant term. This indicates that the CDF of p cor- 
responding to RPM E is about four times more con- 
centrated about the mean than that of RPM D. The 
performance of RPM E is given by E x [S y ] = 10.4942 
as before, and by E x [v y \ ~ . In terms of the latter 

metric, RPM E is 72% better than RPM D. Figure 2 
shows cr-surfaces corresponding to RPM E. The same 
line conventions used before apply. A comparison be- 
tween Figures 1 and 2 indicates that RPM E yields a 
tighter probabilistic description for 100A% of the ob- 
servations than RPM D. The containment condition 
in (26), which will be required to calculate the mod- 
els’ reliability, is s not satisfied for either RPM D or 
RPME. This is reflected in Figures 1 and 2, where the 
cr-surface corresponding to cr max is above y(x. p, p) for 
some x in X. 

The sequential construction of a Type-3 RPM, 



X 

Figure 2: RPM E: Type-3 RPM for a mdx = 1.5 and A = 143/150. 

where the variance v is solved for after solving for the 
support set P, restricts its probabilistic performance 
(i.e., the variance is calculated given an optimal sup- 
port set). This restriction manifests in the lower bound 
(25) to admissible values of cr max . A sequential ap- 
proach reversing the priority order is presented next 
(i.e., P is calculated given an optimal variance). 

4.4 Type-4 RPMs 

A Type-4 RPM is given by by Equations (3, 11), 
where the expected value ft is given by the LS so- 
lution in (8), the variance v is given by (23), and the 
support P is given by the following OP. 

Optimization Problem 5 (OP5): The support set P of the 
random vector p, having expected value // and vari- 
ance v(o maxi A), is given by 

(p,p) =argmin {E x [S y (x,p b ,p a )\ : p a < p < Pb, 

Pb,Pa 

y(xi) <Vi< y(xi ) for i = 1, . . . , N, 

v<{p-p a )& ( Pb ~ d)} ■ (27) 

Hence, a Type-4 RPM minimizes the expected 
spread of the random process given that (i) P con- 
tains //, (ii) the range contains all the observa- 
tions, (iii) the relationship between z> and P satis- 
fies consistency condition (13), and (iv) the a-volume 
I a {x,p, cr max , z>(A)) contains 100A% of the observa- 
tions. Note that the solution to OP3 enters into OP5 
via the lower bound of the last constraint. Further 
notice that OP3, used to calculate z>, is non-convex, 
whereas OP5, used to calculate P, is convex. This 
is the case even though the feasible design space as- 
sociated with the bilinear constraints in (27) is non- 
convex. The equivalence between OP3 and OP2 cov- 
ered in Section 4.2, allows performing a rigorous reli- 
ability analysis of Type-4 RPMs. This analysis quan- 
tifies the probability that a future observation will be 
inside both the a-volume I a (x, p, cr max , z>(A)) and the 
range I y (x,p,p). As before, the containment condi- 



Figure 3: RPM F: Type-4 RPM for A = 1 and tx max = 1. 

tion in (26) plays a key role in the evaluation of the 
model’s reliability. 

Example 2: Next we derive two Type-4 RPMs for 
cr max = 1 and the same data used earlier. The two 
RPMs differ in the value of A used to calculate z>. Be- 
cause c max < a* ydK = 1.4094, there is no Type-3 RPM 
able to satisfy this requirement. This illustrates the 
limitations on the probabilistic performance resulting 
from Type- 3 RPMs. 

The first RPM, to be referred to as RPM F, 
uses A = 1. Hence, we will require that all 
150 observations will be less than one standard 
deviation from the FS prediction. The only sig- 
nificant term in the solution is z>i = 180.3824. 
With 0 available, we then solve for p and p us- 
ing (27). This leads to a support set with limits 
p = [-12.9981,-1.1488,-0.8339,0.0012,-0.0379, 

-0.0006, 0.0032] t , and p = [13.8920,-1.1488, 
-0.8339, 0.0012, -0.0379, 0.0001, 0.0032] T . There- 
fore, whereas the first and sixth component of p vary 
in a range, the other ones can be treated as fixed 
constants. The performance of RPM F is given by 
both E x [u y ] = 180.3824 and E x [8 y ] = 13.4714, which 
are 1620% larger and 83% smaller than those of RPM 
D. Figure 3 shows RPM F. Note that the containment 
condition I a C I y holds for all x e X. The a-volumes 
and limits of I y appear to be centered about the FS 
prediction. This is not the case for other values of 
(7 max (not shown). Because most of the observations 
are close to the mean prediction, it is natural to expect 
that neglecting a few observations will considerably 
improve the model’s performance. 

We now derive a Type-4 RPM for A = 143/150. 
The solution to OP3 indicates that the most signif- 
icant variances are fy = 6.3378, and z> 2 = 3.1509. 
With v available, we then now solve for p and p 
using (27). OP5 yields a support set P with limits 
p = [-10.2605,-3.8559,-0.8480,0.0002,-0.0420, 

-0.0002, 0.0032] t , and p = [2.9670,0.016, 

-0.8200, 0.0022, -0.0338, -0.0001, 0.0032] T . 
Therefore, according to their ranges’ size; the first, 
second and fifth components of p contribute the most 
to the spread in the predicted output. The perfor- 



Figure 4: RPM G: Type-4 RPM for A = 143/150 and a nlax = 1. 

mances of the resulting empirical model, shown in 
Figure 4 and called RPM, are E x [u y ] = 36.3341 and 
E x [S y \ = 12.3649. These values are 246% larger and 
39% smaller than those of RPM E, respectively. The 
containment condition I a C I y , which will be used 
to quantify the models reliability, does not hold at 
x = 0 (not seen in Figure 4). The support set of the 
CDF of pi is not centered about its mean value of 
—0.8734 (Crespo et al. 2015). This causes a sizable 
offset between the mean prediction and the midpoint 
function (y(x) + y(x))/ 2. Further notice that the 
high-probability region of the random process con- 
tains most of the observations whereas the outliers 
only affect I y (x). The limits I v {x), which have a 
derivative discontinuity at x = 0, do not coincide with 
any cr-surface. As expected, the comparison of RPM 
D with RPM F; and of RPM E with RPM G, indicate 
that the improvements in probabilistic performance 
E x [u y ] cause a degradation of the non-probabilistic 
performance E x [%] . 

5 MODEF’S REFIABIFITY 

This section presents a framework for rigorously eval- 
uating the reliability of the predictor models proposed 
above. The reliability of model £, r(£), is the proba- 
bility that a future observation will be compliant with 
the requirements imposed upon the calculation of the 
model. These requirements are cast in terms of an out- 
put y belonging to a a- volume I a (x) for Type-1 and 
Type-2 RPMs, and also to the range I y (x) for Type-3 
and Type-4 RPMs. The developments that follow are 
based on the Scenario Approach (Calafiore & Campi 
2006). 

Denote by P the unknown distribution of the DGM 
from which the points of the data sequence z are ob- 
tained. P can be interpreted as a probabilistic cloud 
in the 1x7 -space. The case in which y is a deter- 
ministic function of x only is a particular case where 
P is concentrated over the function. A general P can 
accommodate situations where the fluctuation in the 
output y is caused by sources other than x. No as- 
sumption is made on P so that the functional form 


relating x and y can be arbitrary. The following the- 
orem, taken from (Campi et al. 2009), permits quan- 
tifying the reliability of an empirical predictor model 
whenever the OP used for its calculation is convex. 

Theorem 1: Let z = { z* } = {(ay,?/*)}, for i = 
1 , ,N, be an independent data sequence resulting 
from a stationary discrete-time data generating pro- 
cess. Suppose the model £ is calculated by solving 
a convex constrained optimization problem having a 
unique solution. Furthermore, assume that k obser- 
vations ( outliers ) out of the N available have been 
discarded when calculating the model. Then, for any 
e G (0, 1) and assuming k < N — d, where d is the 
number of optimization variables used to calculate £, 
it holds that 

Probpiv [r(£) > 1 — e] > 1 — /3, where (28) 

, ( N-d)< 

(N — d)\d) f^(N-d-t)W.(l-ty' ' ' 

The reliability of Type-1 IPMs, to be denoted as X, 
is defined as 

r(X) = Prob P [(x,y) G I y (x,f,p)] . (30) 

The convexity of the OP1 enables the direct applica- 
tion of Theorem 1 . The reliability of Type- 1 and Type- 
2 RPMs, denoted by 1Z, is defined as 

r (77) = Prob P [(x,y) G I a (x, p, a max , z>(A))] . (31) 

The convexity of OP2 enables the direct application of 
Theorem 1 to Type-1 RPMs. This includes the cases 
in which none ( k = 0) and some ( k > 0) of the obser- 
vations are removed from the data set in advance. In 
contrast to OP2, OP3 is non-convex. This opens the 
possibility of (23) having multiple optima. Multiple 
optima may result from the possibility of obtaining 
the same RPM for different sets of outliers. Because 
Type-2 RPMs are calculated by solving a non-convex 
program, Theorem 1 cannot be applied directly. How- 
ever, the reliability of such models can be establish 
by using the Principle of Equivalence. This principle 
is based on identifying an auxiliary convex formula- 
tion that will result in the very same empirical model 
found by solving the non-convex formulation. If this 
is attained, the reliability of the model, which is inde- 
pendent of the means used to calculate it, can be rig- 
orously evaluated via the auxiliary formulation. This 
approach can be applied to Type-2 RPMs. In partic- 
ular, the solution to OP3 using the original the data 
sequence z for a given value of A is equivalent to the 
solution of OP2, which is a convex program, with the 
data sequence w. Because only the N — k* elements 
in w, where 

k* = floor[iV(l — A)], (32) 

are required by the auxiliary program, the reliabil- 


ity of Type-2 RPMs is given by Theorem 1 with 
k = k*. These k* observations fall outside the opti- 
mal cr-volume and satisfy F k ^{k) > A. 

The reliability of Type-3 and Type 4 RPMs is con- 
sidered next. Denote by TZ any of such RPMs. The 
reliability of 7Z is defined as 

r{ TZ) = Prob P [(x, y)eS], (33) 

where5 = I y (x, p,p) n/ 0 -(x,/i,a max ,z>(A)). The fol- 
lowing theorem enables calculating r( TZ). 

Theorem 2: Let z = {zf = {(a^,^)}, for i = 
1 , ,N, be an independent data sequence resulting 
from a stationary discrete-time data generating pro- 
cess and 7 Z be a Type-3 or Type-4 RPM. When the 
containment condition (26) holds, the reliability oflZ 
is given by (28) with d = n. p and k = k* < N — d. 
Otherwise, the reliability oflZ is given by (28) with 
e = ei + 62, where e± is given by (29) for d = 2 n p 
and k = 0; and 62 is given by (29) for d = n p and 
k = k* < N — d. 

Proof When the containment condition holds, the 
two events defining the model’s reliability are depen- 
dent and r(TZ) = Probp \{x,y) G If. In this case the 
reliability is given by Theorem 1 after applying the 
Principle of Equivalence to the non-convex formu- 
lations (24) or (23) to Type-3 and Type-4 RPMs re- 
spectively. In both cases k = k* as defined in (32). 
When set containment does not hold, use the bound 
r{ TZ) > Probp [( x,y ) G I y \ + Probp [(x,y) G If - 1. 
This bound is generally loose, so the actual model’s 
reliability is probably larger. Each of the two events 
in (33) will be considered separately. Since the event 
(x,y) G I y (x) is enforced by solving the convex pro- 
gram in (9) or (27) with N observations, we can read- 
ily bound its probability using Theorem 1 for d = 2 n p 
and k = 0. This leads to e\. Conversely, the event 
(. x,y ) G I a (x) is enforced by solving the non-convex 
programs in (24) for a Type-3 RPM, and (23) for a 
Type-4 RPM. The principle of equivalence enables 
evaluating the probability of this event by considering 
an auxiliary convex program for which k* out of the 
N observations are discarded. This leads to 62- Theo- 
rem 2 results from substituting these expressions into 
Theorem 1. □ 

Example 3: The reliability of RPM D and E, which are 
Type-3 RPMs, is considered first. Since neither model 
satisfies the Containment Condition (26), the reliabil- 
ity of each event must be added. Whereas the first 
event in (33), for which N = 150, k = 0 and d = 14, 
yields 1 — ei = 0.6984 with confidence 1 — (3 = 0.99; 
the second event, for which N = 150, k = 0 and 
d = 7, leads to 1 — e 2 = 0.8050 with the same con- 
fidence. Therefore, the reliability of RPM D is no 
less than 1 — ei — e 2 = l — e = 0.503 with confidence 
1 — (3 = 0.99. In the case of RPM E we have the same 


value for ei as that for RPM D, whereas for the sec- 
ond event, for which N = 150, k = 7 and d — 7, leads 
to 1 — e 2 = 0.6984 with confidence 1 — (3 = 0.99. 
Therefore, the reliability of RPM D is no less than 
1 — ei — e 2 = l — e = 0.3968 with confidence 1-/3 = 
0.99. Hence, discarding seven outliers improved per- 
formance by 74% at the expense of a reduction in the 
reliability of 10%. Finally, we will evaluate the relia- 
bility of RPM F and G, which are Type-4 RPMs. The 
containment condition holds for RPM F but not for 
RPM G. The reliability of RPM F, for which IV = 150, 
k = 0 and d — 7, is no less than 1 — e = 0.8050 with 
confidence 1 — j3 — 0.99. In the case of RPM G, the 
first event in (33), for which N = 150, k = k* = 7 
and d — 7, leads to 1 — e\ — 0.6984 with confidence 
1 — (3 = 0.99; while the second event, for which N = 
150, k — 0 and d = 14, leads to 1 — e 2 = 0.6984 with 
confidence 1 — (3 = 0.99. Therefore, the reliability of 
RPM G is no less than 1 — ei — 62 = 1 — e = 0.3968 
with confidence 1 — f3 = 0.99. The 30% reliability re- 
duction of RPM G relative to RPM F is affected by the 
conservatism in Theorem 2. This illustrates the ben- 
efits of satisfying the containment condition. These 
results illustrate the typical trade-off between perfor- 
mance and reliability. These figures of merit should be 
traded off until the desired balance is reached. This 
balance can be reached by increasing the number of 
observations N, of outliers via A, or by changing the 
model’s structure via n p , which prescribes d. 

6 CONCFUSIONS 

This and the companion paper (Crespo et al. 2015) 
present techniques for constructing random predictor 
models via optimization. These models enable a rigor- 
ous characterization of key features of the prediction, 
and of its reliability. Models with various degrees of 
fidelity are developed. This mathematical framework 
sets forth a new paradigm for the construction of em- 
pirical models in which the model’s performance and 
reliability can be rigorously evaluated and traded-off. 
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